Introduction

In a previous post, I began an investigation into the Motor Vehicle Collisions - Crashes dataset. This personal project is meant to help me (a) practice my geospatial data skills, (b) replicate the spatial dependency model used by Morris et al. (2019), and (c) potentially extend that model to include a time component, which would theoretically allow me to model whether the congestion pricing zone has impacted vehicle collision rates in 2025. In the interest of breaking this project down into digestible and achievable chunks, I have decided that this, part 2 of 3, will focus on purely spatial modeling. Fortunately, Mitzi Morris published this guide to the topic, which I’ll be following.

Data Loading, Joining, Basic Cleaning

I outline most of the requisite data sources for the project in the previous post, although for this post I’m adding some tract-level traffic/built environment data (fragmentation index and traffic volume), as well as some ACS covariates (percentage of commuters that use public transit, population, median income, median age). My process here is to summarize the point-level crash data to the census tract level and then join it with the covariates.

One note: the fragmentation index and traffic variables provided by Morris only cover 2095 tracts (out of 2315 listed). There were also some ACS variables with NA values in a couple rows. Because I’m not trying to save the world, just figure out how to model things, I simply removed these from the analysis. In a different world (or perhaps the future), I would either find more complete data, determine which tracts I’m actually okay with filtering out (e.g., parks?), or interpolate the data in some way.

Below, please find a data table with the final dataset that I’m using.

# Loading NYC Census Tracts: https://data.cityofnewyork.us/City-Government/2020-Census-Tracts/63ge-mke6/about_data 
nyc_tracts <- read.csv("data/2020_Census_Tracts_20250606.csv", stringsAsFactors = FALSE) %>%
  rename("geometry" = "the_geom")
nyc_tracts <- st_as_sf(nyc_tracts, wkt = "geometry", crs = 4326)
nyc_tracts <- nyc_tracts %>%
  mutate(
    area_sqm = Shape_Area / 27878400
  ) %>%
  select(
    geometry,
    BoroCT2020,
    BoroCode,
    BoroName,
    area_sqm,
    GEOID
  )

# Loading collisions data: https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95/about_data 
filter <- c(0, NA)

crashes <- read.csv("data/Motor_Vehicle_Collisions_-_Crashes_20250605.csv") %>%
  select(CRASH.DATE,
         LATITUDE,
         LONGITUDE,
         NUMBER.OF.PERSONS.INJURED,
         NUMBER.OF.PERSONS.KILLED,
         NUMBER.OF.PEDESTRIANS.INJURED,
         NUMBER.OF.PEDESTRIANS.KILLED) %>%
  filter(! LATITUDE %in% filter,
         ! LONGITUDE %in% filter) %>%
  rename(
    "date" = "CRASH.DATE",
    "lat" = "LATITUDE",
    "long" = "LONGITUDE",
    "persons_inj" = "NUMBER.OF.PERSONS.INJURED",
    "persons_death" = "NUMBER.OF.PERSONS.KILLED",
    "ped_inj" = "NUMBER.OF.PEDESTRIANS.INJURED",
    "ped_death" = "NUMBER.OF.PEDESTRIANS.KILLED"
  ) %>%
  mutate(
    date = mdy(date)
    ) %>%
  st_as_sf(coords = c("long","lat"), crs = 4326)

# Grabbing some census variables via Tidycensus, using Keyring for API Key privacy
tidycensus_api_key <- key_get(service = "tidycensus_API", username = "my_tidycensus")
census_api_key(tidycensus_api_key)

census_vars <- get_acs(state = "NY",
                       county = c("Bronx", "Kings", "New York", "Queens", "Richmond"),
                       geography = "tract",
                       variables = c(medincome = "B19013_001",
                                     population = "B01003_001",
                                     median_age = "B01002_001",
                                     transport_baseline = "B08301_001",
                                     transport_pubtransit = "B08301_010"),
                       geometry = FALSE,
                       keep_geo_vars = FALSE,
                       year = 2022,
                       output = "wide"
                       ) %>%
  mutate(
    GEOID = as.numeric(GEOID),
    median_income = medincomeE,
    population = populationE,
    median_age = median_ageE,
    prop_pubtransit = transport_pubtransitE / transport_baselineE
  ) %>%
  select(
    GEOID,
    median_income,
    population,
    median_age,
    prop_pubtransit
  )

# Associating CP Zone, Pre/Post, Treatment, Borough, and Census Tract w/ Observations
crashes <- crashes %>%
  st_join(nyc_tracts, join = st_within) %>%
  filter(! is.na(area_sqm))

# Aggregating/summarizing data to census tract level, no time component
crashes_tract <- crashes %>%
  group_by(BoroCT2020) %>%
  summarize(tot_crashes = n(),
            area = mean(area_sqm)) %>%
  ungroup() %>%
  st_drop_geometry()

# Getting Fragmentation Index from: https://github.com/mitzimorris/geomed_2024/blob/main/data/nyc_study.geojson

frag_data = st_read(file.path("data", "nyc_study.geojson"), quiet = TRUE) %>%
  st_drop_geometry() %>%
  select(BoroCT2010, frag_index, traffic) %>%
  mutate(
    BoroCT2010 = as.numeric(BoroCT2010)
  )

# Joining everything together, selecting only variables that I want
crashes_tracts_geo <- nyc_tracts %>%
  right_join(crashes_tract, 
             by = "BoroCT2020") %>%
  left_join(census_vars,
            by = "GEOID") %>%
  left_join(frag_data,
            by = c("BoroCT2020" = "BoroCT2010")) %>%
  select(! c(area)) %>% 
  filter(! if_any(everything(), is.na))

# Interactive Data Table for Display
crashes_dt <- crashes_tracts_geo %>%
  st_drop_geometry() %>%
  mutate(
    area_sqm = round(area_sqm, 3),
    prop_pubtransit = round(prop_pubtransit, 3)
  )

datatable(crashes_dt,
          extensions = 'Buttons',
          filter = "top",
          rownames = FALSE,
          options = list(
            autoWidth = TRUE,
            scrollX = TRUE
          ),
  class = 'compact',
  escape = FALSE
) %>%
  formatStyle(
    columns = names(crashes_dt),
    `white-space` = "nowrap",
    `height` = "1.5em",
    `line-height` = "1.5em"
  )

More Plots

Though I made some plots in the previous post, I’ll make some more here to hopefully illustrate (a) characteristics of the outcome variable and (b) visually explore how the predictor variables might be related to the outcome.

Histogram of Collisions per Square Mile

ggplot(data = crashes_tracts_geo, aes(x = tot_crashes)) + 
  geom_histogram(aes(y = after_stat(density)), 
                 bins = 500, 
                 color = "lightblue",
                 fill = "lightblue",
                 alpha = 0.75) + 
  geom_density(color = "#FFC600", linewidth = 1) + 
  theme_bw() + 
  labs(title = "Distribution Total Crashes per Census Tract (2024-25)",
       x = "Number of Crashes",
       y = "Density")

This distribution has a mean of 50, standard deviation of 35, and variance of 1225, which is not great news for the Poisson distribution (which assumes that mean = variance). We’ll see how that does

Choropleth Maps

Not all the predictors have a super clear relationship with the outcome variable, but it’s a good idea to include them anyway.

Map of Collisions

It’s worth noting that, in my previous post, I looked at crashes per square mile, whereas this analysis simply uses the raw counts. I believe this is (a) because the Poisson model is for discrete count data and (b) the dependency models adjust for population. I’m not 100% clear on why they’re not using the per square mile measure, though.

crash_map <- tm_shape(crashes_tracts_geo) + 
  tm_polygons(fill = "tot_crashes",
              fill.scale = tm_scale_intervals(values = "brewer.yl_or_rd",
                                              style = "jenks"),
              fill.legend = tm_legend(title = "Total Crashes, 2024-25"),
              lwd = 0.3) +
  tm_layout(legend.outside = TRUE, 
            legend.outside.position = "right") 

crash_map

Predictor Maps

Median Income // Median Age

income_map <- tm_shape(crashes_tracts_geo) + 
  tm_polygons(fill = "median_income",
              fill.scale = tm_scale_intervals(values = "brewer.greens",
                                              style = "jenks"),
              fill.legend = tm_legend(title = "Median Income"),
              lwd = 0.3) +
  tm_layout(legend.outside = TRUE, 
            legend.outside.position = "right")

medianage_map <- tm_shape(crashes_tracts_geo) + 
  tm_polygons(fill = "median_age",
              fill.scale = tm_scale_intervals(values = "brewer.blues",
                                              style = "jenks"),
              fill.legend = tm_legend(title = "Median Age"),
              lwd = 0.3) +
  tm_layout(legend.outside = TRUE, 
            legend.outside.position = "right")

tmap_arrange(income_map,
             medianage_map,
             nrow = 1, ncol = 2)

Proportion Using Public Transit to Commute // Population

prop_pubtransit_map <- tm_shape(crashes_tracts_geo) + 
  tm_polygons(fill = "prop_pubtransit",
              fill.scale = tm_scale_intervals(values = "brewer.purples",
                                              style = "jenks",
                                              value.na = "grey"),
              fill.legend = tm_legend(title = "Proportion that Commute Using Public Transit"),
              lwd = 0.3) +
  tm_layout(legend.outside = TRUE, 
            legend.outside.position = "right")

pop_per_sqm_map <- tm_shape(crashes_tracts_geo) + 
  tm_polygons(fill = "population",
              fill.scale = tm_scale_intervals(values = "brewer.yl_gn",
                                              style = "jenks"),
              fill.legend = tm_legend(title = "Population"),
              lwd = 0.3) +
  tm_layout(legend.outside = TRUE, 
            legend.outside.position = "right")

tmap_arrange(prop_pubtransit_map,
             pop_per_sqm_map,
             nrow = 1, ncol = 2)

Fragmentation Index // Traffic

frag_map <- tm_shape(crashes_tracts_geo) + 
  tm_polygons(fill = "frag_index",
              fill.scale = tm_scale_intervals(values = "brewer.pu_or",
                                              style = "quantile",
                                              midpoint = 0,
                                              value.na = "grey"),
              fill.legend = tm_legend(title = "Fragmentation Index"),
              lwd = 0.3) +
  tm_layout(legend.outside = TRUE, 
            legend.outside.position = "right")

traffic_map <- tm_shape(crashes_tracts_geo) + 
  tm_polygons(fill = "traffic",
              fill.scale = tm_scale_intervals(values = "brewer.reds",
                                              style = "quantile",
                                              midpoint = 0,
                                              value.na = "grey"),
              fill.legend = tm_legend(title = "Traffic Volume"),
              lwd = 0.3) +
  tm_layout(legend.outside = TRUE, 
            legend.outside.position = "right")

tmap_arrange(frag_map,
             traffic_map,
             nrow = 1, ncol = 2)

Moving from a Map to Spatial Dependencies

Ok–so we’ve visualized ad nauseum. Now, we’ve got to set up a spatial weights matrix. Essentially, something that tells us which of the tracts neighbor each other. In a different setting, we may want to vary weights (greater weight to tracts closer to the central tract, decreasing weights as we move away), but here we just set the weight equal to 1 if the tracts are neighbors and 0 if they are not.

Computing Properties of the Node Graph

In the outputs above and below, we can see that there are 2 regions with 0 links (islands), and 4 disjoint subgraphs, meaning networks that there are no paths between (Staten Island is a disjoint subgraph). This was done using the Queen’s case method (think of the squares that a Queen can reach on a chessboard–not only directly bordering neighbors, but diagonals as well.

Cleaning Neighbor Graph Objects

Exactly what was needed in this department was a touch confusing to me. Essentially, I got the idea that (a) I should examine whether existing links in the graph made sense (e.g., if there’s a link across water, does it make sense to keep it in from a pedestrian-focused context?) and then (b) make sure the final graph is fully connected, because that’s a requirement of the ICAR model I want to use. In Morris’s example it seems that part (a) was mostly an exercise in spatial data manipulation, whereas the actual dataset used for the final analysis is the original graph with all of its cross-water connections and a few added links to make it a single component. The code below achieves this, and checks my work.

Checking My Work

cat('is symmetric? ', is.symmetric.nb(nyc_nbs_connected), '\n')
## is symmetric?  TRUE
summary(nyc_nbs_connected)
## Neighbour list object:
## Number of regions: 1956 
## Number of nonzero links: 10800 
## Percentage nonzero weights: 0.2822839 
## Average number of links: 5.521472 
## 7 disjoint connected subgraphs
## Link number distribution:
## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13 
##  19  52 144 287 461 469 313 150  39  12   6   2   2 
## 19 least connected regions:
## 55 257 259 521 581 582 644 835 839 893 1210 1249 1402 1490 1576 1591 1605 1635 1893 with 1 link
## 2 most connected regions:
## 1718 1785 with 13 links
cat('nodes per component')
## nodes per component
table(n.comp.nb(nyc_nbs_connected)$comp.id)
## 
##    1 
## 1956

We can see here that the graph is symmetric, and that there is only a single component listed, with 1956 nodes. One confusing thing is that there are still 7 disjoint subgraphs in the output. I genuinely don’t know why this is and I cannot figure it out, so if you’re reading this and you do know, please email me.

Basic Model Building Workflow

Finally, the fun part. Now, I know that model building should be an extremely stepwise and iterative process–skipping steps is likely to result in issues. That said, I want to balance this with post length, so I’m roughly planning to include one model + output for each type of model I’m building.

Simple Poisson Model

The Model

poisson_1_model_file = file.path('stan', 'poisson_1.stan')
cat(readLines(poisson_1_model_file), sep="\n")
## data {
##   int<lower=0> N;
##   array[N] int<lower=0> y; // count outcomes
##   vector<lower=0>[N] E; // exposure
##   int<lower=1> K; // num covariates
##   matrix[N, K] xs; // design matrix
## }
## transformed data {
##   vector[N] log_E = log(E);
##   
##   // center continuous predictors 
##   vector[K] means_xs;  // column means of xs before centering
##   matrix[N, K] xs_centered;  // centered version of xs
##   for (k in 1:K) {
##     means_xs[k] = mean(xs[, k]);
##     xs_centered[, k] = xs[, k] - means_xs[k];
##   }
## }
## parameters {
##   real beta0; // intercept
##   vector[K] betas; // covariates
## }
## model {
##   y ~ poisson_log(log_E + beta0 + xs_centered * betas);
##   beta0 ~ std_normal();
##   betas ~ std_normal();
## }
## generated quantities {
##   real beta_intercept = beta0 - dot_product(means_xs, betas);  // adjust intercept
##   array[N] int y_rep;
##   vector[N] log_lik;
##   {
##     vector[N] eta = log_E + beta0 + xs_centered * betas;   // centered data
##     y_rep = max(eta) < 26 ? poisson_log_rng(eta) : rep_array(-1, N);
##     for (n in 1:N) {
##       log_lik[n] = poisson_log_lpmf(y[n] | eta[n]);
##     }
##   }
## }

So, this is not my model, it’s the simple Poisson model that Morris uses. So, rather than understanding it because I wrote it, I’m going to understand it because I explain it to you in this section. First, the data:

  1. N: The number of observations (in this case, the number of census tracts)
  2. y: The outcome measure, number of crashes in 2024-25
  3. E: The “exposure” variable, which is the population of each tract. I believe this is separated from the rest of the predictor matrix, xs, so that it can be log transformed and then used as a separate component in the linear predictor.
  4. K: The number of predictor variables
  5. xs: A matrix containing N rows and K columns, which holds the predictor variables

The transformed data block takes the log of population (E), which is a generally good principle for data that’s constrained to be all positive. It also means that the coefficient goes to the multiplicative scale, which is nice for interpretation.

In the parameters block, Morris defines an intercept, beta0, and a vector of coefficients, betas, for the covariates. The beta0 coefficient is split out so it can be applied to the log_E data.

The model specifies that y comes from a Poisson distribution, and gives beta0 and betas simple standard normal distributions (mean 0, sd 1).

Lastly, the generated quantities block. In Morris’s words, this is for use in model checking and model comparison. Y_rep is used to hold replicated data. Morris uses some optimization in this block that I don’t entirely (or even really) understand, but this will be used to conduct posterior predictive checks. That is, we can simulate fake data and check their characteristics (mean, sd, quantiles) to see if the model does a good job. Morris uses leave-one-out cross validation, and uses the output from the loo() function–Expected Log Predictive Density (ELPD)–which gives the log-probability of new data given the model. A higher ELPD is better.

In the guide materials, Morris also makes some improvements to the model. Most notably, mean-centering the predictors. I ran the model without these adjustments and then added them. Output was, naturally, pretty similar. The improved model ran in only about 3 seconds per chain on my computer, as compared to 10-12 seconds for the original.

Running the Model and Examining the Output

design_mat <- as.data.frame(crashes_tracts_geo) %>%
  select(prop_pubtransit, median_income, median_age, frag_index, traffic) %>%
  mutate(median_income = log(median_income),
         traffic = log(traffic))

pois_data <- list(
  N = nrow(crashes_tracts_geo),
  y = as.integer(crashes_tracts_geo$tot_crashes),
  E = as.integer(crashes_tracts_geo$population),
  K = ncol(design_mat),
  xs = design_mat
)
poisson_model_1 <- cmdstan_model(poisson_1_model_file)
set.seed(50)

poisson_fit_1 <- poisson_model_1$sample(data=pois_data, parallel_chains=cores, refresh=0)
## Running MCMC with 4 chains, at most 6 in parallel...
## 
## Chain 1 finished in 2.0 seconds.
## Chain 2 finished in 2.0 seconds.
## Chain 3 finished in 2.1 seconds.
## Chain 4 finished in 2.3 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 2.1 seconds.
## Total execution time: 2.5 seconds.
pois_summary <- poisson_fit_1$summary()
vars <- c('beta_intercept', 'beta0', 'betas[1]', 'betas[2]', 'betas[3]', 'betas[4]', 'betas[5]')
pois_summary_subset <- pois_summary[pois_summary$variable %in% vars, ]
numeric_cols <- sapply(pois_summary_subset, is.numeric)
pois_summary_subset[numeric_cols] <- round(pois_summary_subset[numeric_cols], 3)

print(pois_summary_subset)
## # A tibble: 7 × 10
##   variable         mean median    sd   mad     q5    q95  rhat ess_bulk ess_tail
##   <chr>           <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>    <dbl>    <dbl>
## 1 beta0          -4.43  -4.43  0.003 0.003 -4.43  -4.42   1       2937.    2513.
## 2 betas[1]       -0.196 -0.196 0.025 0.025 -0.237 -0.156  1.00    1649.    2157.
## 3 betas[2]       -0.025 -0.025 0.007 0.007 -0.036 -0.013  1.00    1714.    1762.
## 4 betas[3]       -0.006 -0.006 0     0     -0.007 -0.005  1       3763.    3166.
## 5 betas[4]        0.004  0.004 0.001 0.001  0.002  0.007  1.00    4238.    2839.
## 6 betas[5]        0.263  0.263 0.003 0.003  0.257  0.268  1       2838.    2529.
## 7 beta_intercept -6.45  -6.45  0.09  0.088 -6.60  -6.31   1.00    1640.    2038.

Ok, so here we have some output. First, the rhat values are all 1.00, which indicates the model mixed properly. At a basic level, the negative coefficients on the first three predictors (prop_pubtransit, median_income, median_age), indicate that higher levels of public transit use, median income, and median age are all associated with lower counts of crashes in a census tract. Conversely, higher fragmentation index (beta[5]) and traffic volume (beta[6]) are associated with higher counts of crashes. These all pass the sniff test.

Posterior Predictive Checks

This is where we’ll use the y_rep predicted values from the generated quantities block of the Stan model. We’re hoping that the characteristics (mean, sd, quantiles) of y_rep match the true data as closely as possible. Morris included the helper functions ppc_central_interval and ppc_y_yrep_overlay in the “utils_dataviz.R” file.

y_rep_pois <- as_draws_matrix(poisson_fit_1$draws("y_rep"))
ppc_central_interval(y_rep_pois, pois_data$y)
## [1] "y total: 1956, ct y is within y_rep central 50% interval: 360, pct: 18.40"

The output above shows that only 18% of the observed data points fall within the 50% predicted interval, which is not great. The graph below illustrates this. It does seem that the model doesn’t handle low or high values all that well.

ppc_y_yrep_overlay(y_rep_pois, pois_data$y,
                     'Poisson model PPC\ny (blue dot) vs. y_rep (yellow 50% central interval, grey full extent)')

Adding Random Effects to the Model

Morris makes the point that the data are overdispersed (we saw that coming when looking at the distribution of the outcome variable earlier), meaning that the variance is higher than we expect (“we” being the users of the single-parameter Poisson model). The recommended solution is to add a simple random effects component to the model, which accounts for per-tract heterogeneity. This is similar to adding indicator variables for each tract, but assumes these varying intercepts come from a distribution.

The Model

poisson_2_model_file = file.path('stan', 'poisson_2.stan')
cat(readLines(poisson_2_model_file), sep="\n")
## data {
##   int<lower=0> N;
##   array[N] int<lower=0> y; // count outcomes
##   vector<lower=0>[N] E; // exposure
##   int<lower=1> K; // num covariates
##   matrix[N, K] xs; // design matrix
## }
## transformed data {
##   vector[N] log_E = log(E);
##   
##   // center continuous predictors 
##   vector[K] means_xs;  // column means of xs before centering
##   matrix[N, K] xs_centered;  // centered version of xs
##   for (k in 1:K) {
##     means_xs[k] = mean(xs[, k]);
##     xs_centered[, k] = xs[, k] - means_xs[k];
##   }
## }
## parameters {
##   real beta0; // intercept
##   vector[K] betas; // covariates
##   vector[N] theta; // heterogeneous random effects
##   real<lower=0> sigma; // random effects variance
## }
## model {
##   y ~ poisson_log(log_E + beta0 + xs_centered * betas + theta * sigma);
##   beta0 ~ std_normal();
##   betas ~ std_normal();
##   theta ~ std_normal();
##   sigma ~ normal(0, 5);
## }
## generated quantities {
##   real beta_intercept = beta0 - dot_product(means_xs, betas);  // adjust intercept
##   array[N] int y_rep;
##   vector[N] log_lik;
##   {
##     vector[N] eta = log_E + beta0 + xs_centered * betas + theta * sigma;   // centered data
##     y_rep = max(eta) < 26 ? poisson_log_rng(eta) : rep_array(-1, N);
##     for (n in 1:N) {
##       log_lik[n] = poisson_log_lpmf(y[n] | eta[n]);
##     }
##   }
## }

So, the difference in the new model is that it includes additional parameters theta and sigma, which define the distribution that the varying intercepts are drawn from.

Running the Model and Examining the Output

# Unchanged from previous model
design_mat <- as.data.frame(crashes_tracts_geo) %>%
  select(prop_pubtransit, median_income, median_age, frag_index, traffic) %>%
  mutate(median_income = log(median_income),
         traffic = log(traffic))

pois_data <- list(
  N = nrow(crashes_tracts_geo),
  y = as.integer(crashes_tracts_geo$tot_crashes),
  E = as.integer(crashes_tracts_geo$population),
  K = ncol(design_mat),
  xs = design_mat
)
poisson_model_2 <- cmdstan_model(poisson_2_model_file)
set.seed(50)

poisson_fit_2 <- poisson_model_2$sample(data=pois_data, parallel_chains=cores, refresh=0)
## Running MCMC with 4 chains, at most 6 in parallel...
## 
## Chain 2 finished in 11.0 seconds.
## Chain 3 finished in 15.6 seconds.
## Chain 1 finished in 15.8 seconds.
## Chain 4 finished in 18.1 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 15.1 seconds.
## Total execution time: 18.2 seconds.
pois_summary <- poisson_fit_2$summary()
vars <- c('beta_intercept', 'beta0', 'betas[1]', 'betas[2]', 'betas[3]', 'betas[4]', 'betas[5]', 'sigma')
pois_summary_subset <- pois_summary[pois_summary$variable %in% vars, ]
numeric_cols <- sapply(pois_summary_subset, is.numeric)
pois_summary_subset[numeric_cols] <- round(pois_summary_subset[numeric_cols], 3)

print(pois_summary_subset)
## # A tibble: 8 × 10
##   variable         mean median    sd   mad     q5    q95  rhat ess_bulk ess_tail
##   <chr>           <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>    <dbl>    <dbl>
## 1 beta0          -4.47  -4.47  0.013 0.014 -4.49  -4.45   1.01     284.     632.
## 2 betas[1]       -0.043 -0.04  0.113 0.112 -0.234  0.139  1.02     203.     424.
## 3 betas[2]        0.002  0.002 0.031 0.033 -0.051  0.052  1.01     254.     501.
## 4 betas[3]       -0.003 -0.003 0.002 0.002 -0.006  0      1.01     272.     378.
## 5 betas[4]        0.016  0.016 0.006 0.006  0.008  0.026  1.01     243.     459.
## 6 betas[5]        0.265  0.265 0.015 0.016  0.239  0.29   1.00     208.     663.
## 7 sigma           0.593  0.592 0.01  0.01   0.576  0.61   1.01     398.     818.
## 8 beta_intercept -7.01  -7.03  0.4   0.422 -7.62  -6.31   1.01     230.     567.

Alright, the rhat values look fine. Some of the coefficients are slightly different as a result of adding the varying intercepts, but I’m gonna breeze past interpretation for now. Let’s check the model with posterior predictive checks.

Posterior Predictive Checks

y_rep_pois <- as_draws_matrix(poisson_fit_2$draws("y_rep"))
ppc_central_interval(y_rep_pois, pois_data$y)
## [1] "y total: 1956, ct y is within y_rep central 50% interval: 1947, pct: 99.54"

Hmmmm…this is fishy, too. We want about 50% of the observations to fall within the 50% interval, not almost 100%. I guess this model is way over-fit, which probably makes it a good candidate for some leave-one-out cross-validation.

ppc_y_yrep_overlay(y_rep_pois, pois_data$y,
                     'Poisson model PPC\ny (blue dot) vs. y_rep (yellow 50% central interval, grey full extent)')

Leave-One-Out Cross-Validation (LOO-CV)

So, LOO-CV should allow us to compare the models on how they perform on unseen data.

loo_1_pois <- loo(poisson_fit_1$draws("log_lik"), save_psis = TRUE)
loo_2_pois <- loo(poisson_fit_2$draws("log_lik"), save_psis = TRUE)

loo_compare(loo_1_pois, loo_2_pois)
##        elpd_diff se_diff 
## model2      0.0       0.0
## model1 -16787.4    1034.9

The output above identifies the varying-intercepts model as a better fit to the data, which we kind of knew already. I am interested in those Pareto k diagnostic issues:

print("Model 1")
## [1] "Model 1"
pareto_k_table(loo_1_pois)
## Pareto k diagnostic values:
##                          Count Pct.    Min. ESS
## (-Inf, 0.7]   (good)     1955  99.9%   124     
##    (0.7, 1]   (bad)         1   0.1%   <NA>    
##    (1, Inf)   (very bad)    0   0.0%   <NA>
print("Model 2")
## [1] "Model 2"
pareto_k_table(loo_2_pois)
## Pareto k diagnostic values:
##                          Count Pct.    Min. ESS
## (-Inf, 0.7]   (good)      331  16.9%   147     
##    (0.7, 1]   (bad)      1286  65.7%   <NA>    
##    (1, Inf)   (very bad)  339  17.3%   <NA>

So, the second model does fit the data better, but the output in the second table tells me that upwards of 80% of the data points in my sample have unreliable LOO estimates. Morris mentions that this model handles overdispersion, but fails to distinguish between variance due to the spatial structure of the data and purely heterogeneous variance. If you think about it, this sort of makes sense. Model 2 doesn’t allow the tracts around a given tract to have any impact on the estimates for that tract, which is going to cause unreliability.

Intrinsic Conditional Auto-Regressive (ICAR) Modeling

The ICAR model is a model of spatial dependency. Intuitively, it’s pretty simple: the value of a region often depends on its neighbors. In a census tract situation, this is a totally reasonable intuition–it’s pretty likely that census tracts next to each other are similar, whether that be in terms of who lives there, what the roads are like, etc. We’ll include an ICAR prior in our Stan model–accounting for the correlation between spatial units by linking the estimates spatially. I’m sure Morris or Andrew Gelman would be horrified by that imprecise explanation, but that’s why I’m not a PhD student.

We can return to that spatially-linked data from earlier, finally. Since I spent the time building a fully-connected NYC graph, I’m going to model the whole city.

The ICAR Model

icar_1_model_file = file.path('stan', 'icar_1.stan')
cat(readLines(icar_1_model_file), sep="\n")
## data {
##   int<lower=0> N;
##   array[N] int<lower=0> y; // count outcomes
##   vector<lower=0>[N] E; // exposure
##   int<lower=1> K; // num covariates
##   matrix[N, K] xs; // design matrix
## 
##   // spatial structure
##   int<lower = 0> N_edges;  // number of neighbor pairs
##   array[2, N_edges] int<lower = 1, upper = N> neighbors;  // columnwise adjacent
## }
## transformed data {
##   vector[N] log_E = log(E);
##   // center continuous predictors 
##   vector[K] means_xs;  // column means of xs before centering
##   matrix[N, K] xs_centered;  // centered version of xs
##   for (k in 1:K) {
##     means_xs[k] = mean(xs[, k]);
##     xs_centered[, k] = xs[, k] - means_xs[k];
##   }
## }
## parameters {
##   real beta0; // intercept
##   vector[K] betas; // covariates
##   sum_to_zero_vector[N] phi; // spatial effects
##   real<lower=0> sigma; // overall spatial variance
## }
## model {
##   y ~ poisson_log(log_E + beta0 + xs_centered * betas + phi * sigma);
##   beta0 ~ std_normal();
##   betas ~ std_normal();
##   sigma ~ std_normal();
##   target += -0.5 * dot_self(phi[neighbors[1]] - phi[neighbors[2]]);  // ICAR prior
## }
## generated quantities {
##   real beta_intercept = beta0 - dot_product(means_xs, betas);  // adjust intercept
##   array[N] int y_rep;
##   {
##     vector[N] eta = log_E + beta0 + xs_centered * betas + phi * sigma;
##     y_rep = max(eta) < 26 ? poisson_log_rng(eta) : rep_array(-1, N);
##   }
## }

Running the Model and Examining the Output

nbs_adj = nbs_to_adjlist(nyc_nbs_connected)

design_mat <- as.data.frame(crashes_tracts_geo) %>%
  select(prop_pubtransit, median_income, median_age, frag_index, traffic) %>%
  mutate(median_income = log(median_income),
         traffic = log(traffic))

icar_data <- list(
  N = nrow(crashes_tracts_geo),
  y = as.integer(crashes_tracts_geo$tot_crashes),
  E = as.integer(crashes_tracts_geo$population),
  K = ncol(design_mat),
  xs = design_mat,
  N_edges = ncol(nbs_adj),
  neighbors = nbs_adj
)
icar_model_1 <- cmdstan_model(icar_1_model_file)
set.seed(50)

icar_fit_1 <- icar_model_1$sample(data=icar_data, parallel_chains=cores, refresh=0)
## Running MCMC with 4 chains, at most 6 in parallel...
## 
## Chain 2 finished in 14.3 seconds.
## Chain 1 finished in 17.6 seconds.
## Chain 4 finished in 21.5 seconds.
## Chain 3 finished in 23.9 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 19.3 seconds.
## Total execution time: 24.0 seconds.
icar_summary <- icar_fit_1$summary()
vars <- c('beta_intercept', 'beta0', 'betas[1]', 'betas[2]', 'betas[3]', 'betas[4]', 'betas[5]', 'sigma')
icar_summary_subset <- icar_summary[icar_summary$variable %in% vars, ]
numeric_cols <- sapply(icar_summary_subset, is.numeric)
icar_summary_subset[numeric_cols] <- round(icar_summary_subset[numeric_cols], 3)

print(icar_summary_subset)
## # A tibble: 8 × 10
##   variable         mean median    sd   mad     q5    q95  rhat ess_bulk ess_tail
##   <chr>           <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>    <dbl>    <dbl>
## 1 beta0          -4.47  -4.47  0.004 0.004 -4.48  -4.46   1       6136.    2895.
## 2 betas[1]       -0.377 -0.375 0.129 0.131 -0.591 -0.165  1.01     225.     552.
## 3 betas[2]        0.028  0.028 0.043 0.045 -0.043  0.1    1.02     127.     280.
## 4 betas[3]        0.006  0.006 0.002 0.002  0.002  0.01   1.02     272.     576.
## 5 betas[4]        0.046  0.046 0.008 0.008  0.032  0.06   1.02     145.     224.
## 6 betas[5]        0.285  0.284 0.016 0.015  0.26   0.311  1.02     237.     466.
## 7 sigma           1.14   1.14  0.021 0.021  1.11   1.18   1.01     392.     853.
## 8 beta_intercept -7.69  -7.70  0.537 0.552 -8.57  -6.81   1.02     171.     370.

Well then, the rhat values are still pretty low! Again, not going to fuss with coefficient interpretation just yet. Let’s check the fit with the same posterior tools as before, as well as an ICAR correlation matrix.

Checking the Fit

The ICAR Correlation Matrix

phi_icar <- as_draws_matrix(icar_fit_1$draws("phi"))
plot_icar_corr_matrix(phi_icar, 'ICAR correlation matrix')

I won’t lie, the correlation matrix that Morris put together (admittedly with only Brooklyn data) looks a bit “better” than mine. I gather that we’re supposed to see greater correlation along the diagonal, since the components are listed in roughly geographic order. This does seem like a plot, though, that’s heavily dependent on the order of your data…so I’m not super confident that mine is set up entirely the same as Morris’s.

Posterior Predictive Checks

Before the checks, it’s worth noting that apparently LOO doesn’t work for ICAR, which I suppose makes sense, given that removing an observation would impact the spatial dependency network.

We’ll replicate the steps from the basic Poisson model to check our fit.

y_rep_icar <- as_draws_matrix(icar_fit_1$draws("y_rep"))
ppc_central_interval(y_rep_icar, icar_data$y)
## [1] "y total: 1956, ct y is within y_rep central 50% interval: 1931, pct: 98.72"
ppc_y_yrep_overlay(y_rep_icar, icar_data$y,
                     'Poisson + ICAR model PPC\ny (blue dot) vs. y_rep (orange 50% central interval, grey full extent)')

So, we can see that the predictive coverage is still comparable to the more basic Poisson models that I ran. 98% coverage is still pretty high, but I believe this is because there’s only one observation per tract, so there’s not a great understanding of within-tract variability.

Morris basically makes the point that the ICAR component effectively accounts for the spatial dependence in the data while providing similar predictive coverage to the random effects Poisson model. The next step is to combine the two into a BYM2 model.

Blending the Approaches with the Besag York Mollié (BYM) Model

The Besag York Mollié (BYM) model has both a spatial dependency parameter and a random effects parameter. It uses a mixing parameter to get a blend between the two, depending on how much of the variation seen in the outcome variable is spatial.

The BYM2 Model

bym2_1_model_file = file.path('stan', 'bym2_1.stan')
cat(readLines(bym2_1_model_file), sep="\n")
## data {
##   int<lower=0> N;
##   array[N] int<lower=0> y; // count outcomes
##   vector<lower=0>[N] E; // exposure
##   int<lower=1> K; // num covariates
##   matrix[N, K] xs; // design matrix
## 
##   // spatial structure
##   int<lower = 0> N_edges;  // number of neighbor pairs
##   array[2, N_edges] int<lower = 1, upper = N> neighbors;  // columnwise adjacent
##   
##   real tau; // scaling factor
## }
## transformed data {
##   vector[N] log_E = log(E);
##   // center continuous predictors 
##   vector[K] means_xs;  // column means of xs before centering
##   matrix[N, K] xs_centered;  // centered version of xs
##   for (k in 1:K) {
##     means_xs[k] = mean(xs[, k]);
##     xs_centered[, k] = xs[, k] - means_xs[k];
##   }
## }
## parameters {
##   real beta0; // intercept
##   vector[K] betas; // covariates
##   real<lower=0, upper=1> rho; // proportion of spatial variance
##   sum_to_zero_vector[N] phi;  // spatial effects
##   vector[N] theta; // heterogeneous random effects
##   real<lower = 0> sigma;  // scale of combined effects
## }
## transformed parameters {
##   vector[N] gamma = sqrt(1 - rho) * theta + sqrt(rho * inv(tau)) * phi;  // BYM2
## }
## model {
##   y ~ poisson_log(log_E + beta0 + xs_centered * betas + gamma * sigma);
##   rho ~ beta(0.5, 0.5);
##   target += -0.5 * dot_self(phi[neighbors[1]] - phi[neighbors[2]]); // ICAR prior
##   beta0 ~ std_normal();
##   betas ~ std_normal();
##   theta ~ std_normal();
##   sigma ~ std_normal();
## }
## generated quantities {
##   real beta_intercept = beta0 - dot_product(means_xs, betas);  // adjust intercept
##   array[N] int y_rep;
##   {
##     vector[N] eta = log_E + beta0 + xs_centered * betas + gamma * sigma;
##     y_rep = max(eta) < 26 ? poisson_log_rng(eta) : rep_array(-1, N);
##   }
## }

This model is still largely the same as the ones I’ve been running, with a few changes:

  1. Tau is a new component in the data block. Tau is used to rescale the variance of Phi (the spatial covariance matrix) to be approximately 1, in order to match the variance on the non-spatial random effects.
  2. Rho is a new parameter which estimates the proportion of variance explained by spatial effects. Note that the parameters block has both a spatial effects parameter (phi) and a random effects parameter (theta).
  3. The Transformed Parameters block calculates gamma, which is the blended mix of spatial and non-spatial effects.
  4. The Model and Generated Quantities blocks have been updated to properly utilize the new additions.

Setting up the Data Inputs

This is almost identical to what I’ve been doing, with the addition now of tau.

nbs_adj = nbs_to_adjlist(nyc_nbs_connected)

design_mat <- as.data.frame(crashes_tracts_geo) %>%
  select(prop_pubtransit, median_income, median_age, frag_index, traffic) %>%
  mutate(median_income = log(median_income),
         traffic = log(traffic))

tau = get_scaling_factor(nyc_nbs_connected)

bym2_data <- list(
  N = nrow(crashes_tracts_geo),
  y = as.integer(crashes_tracts_geo$tot_crashes),
  E = as.integer(crashes_tracts_geo$population),
  K = ncol(design_mat),
  xs = design_mat,
  N_edges = ncol(nbs_adj),
  neighbors = nbs_adj,
  tau = tau
)

Prior Predictive Checks

This is new to me, but Morris recommends simulating data directly from the prior. Essentially, it seems like we can look at the simulated data, which demonstrates the range of likely data based on the priors, and see if they seem well-conditioned. We’re going to simulate the parameters gamma, phi, theta, and rho using a stripped-down version of the model above, now excluding the covariates and outcome variable.

bym2_gamma_ppc_model_file = file.path('stan', 'bym2_gamma_ppc.stan')
cat(readLines(bym2_gamma_ppc_model_file), sep="\n")
## data {
##   int<lower=0> N;
##   // spatial structure
##   int<lower = 0> N_edges;  // number of neighbor pairs
##   array[2, N_edges] int<lower = 1, upper = N> neighbors;  // columnwise adjacent
## 
##   real tau; // scaling factor
## }
## parameters {
##   real<lower=0, upper=1> rho; // proportion of spatial variance
##   sum_to_zero_vector[N] phi;  // spatial effects
##   vector[N] theta; // heterogeneous random effects
##   real<lower = 0> sigma;  // scale of combined effects
## }
## transformed parameters {
##   vector[N] gamma = sqrt(1 - rho) * theta + sqrt(rho * inv(tau)) * phi;  // BYM2
## }
## model {
##   rho ~ beta(0.5, 0.5);
##   target += -0.5 * dot_self(phi[neighbors[1]] - phi[neighbors[2]]); // ICAR prior
##   theta ~ std_normal();
##   sigma ~ std_normal();
## }

With that model, we can run and take a look at the estimated values of phi, theta, and rho. In the output below, you can see that theta and rho are close to 1 and 0.5, which is what we want. I think the estimate for phi is a bit low…which I’m not entirely sure what to do with. I guess that we’re meant to be seeing whether the variance for phi and theta are comparable/on similar scales, which they largely are.

bym2_gamma_ppc_model <- cmdstan_model(bym2_gamma_ppc_model_file)
set.seed(50)

bym2_gamma_ppc_fit <- bym2_gamma_ppc_model$sample(data=bym2_data, parallel_chains=cores, refresh=0)
## Running MCMC with 4 chains, at most 6 in parallel...
## 
## Chain 1 finished in 17.7 seconds.
## Chain 3 finished in 18.0 seconds.
## Chain 2 finished in 19.1 seconds.
## Chain 4 finished in 19.9 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 18.7 seconds.
## Total execution time: 20.1 seconds.
phi <- as_draws_matrix(bym2_gamma_ppc_fit$draws("phi"))
phi_var_by_region <- apply(sqrt(1/tau) * phi, 2, var)  # variance across draws for each region
print(paste("mean variance of phi scaled:", mean(phi_var_by_region)))
## [1] "mean variance of phi scaled: 0.890123773373277"
theta <- as_draws_matrix(bym2_gamma_ppc_fit$draws("theta"))
theta_var_by_region <- apply(theta, 2, var)  # variance across draws for each region
print(paste("mean variance of theta:", mean(theta_var_by_region)))
## [1] "mean variance of theta: 1.0004212323147"
rho <- as_draws_array(bym2_gamma_ppc_fit$draws("rho"))
print(paste("mean estimate of rho:", mean(rho)))
## [1] "mean estimate of rho: 0.498749615924182"

Let’s Run the Damn Thing

bym2_model_1 <- cmdstan_model(bym2_1_model_file)
set.seed(50)

bym2_fit_1 <- bym2_model_1$sample(data=bym2_data, parallel_chains=cores, refresh=0)
## Running MCMC with 4 chains, at most 6 in parallel...
## 
## Chain 4 finished in 63.1 seconds.
## Chain 3 finished in 90.5 seconds.
## Chain 1 finished in 99.2 seconds.
## Chain 2 finished in 119.6 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 93.1 seconds.
## Total execution time: 119.8 seconds.
bym2_summary <- bym2_fit_1$summary()
vars <- c('beta_intercept', 'beta0', 'betas[1]', 'betas[2]', 'betas[3]', 'betas[4]', 'sigma', 'rho')
bym2_summary_subset <- bym2_summary[bym2_summary$variable %in% vars, ]
numeric_cols <- sapply(bym2_summary_subset, is.numeric)
bym2_summary_subset[numeric_cols] <- round(bym2_summary_subset[numeric_cols], 3)

print(bym2_summary_subset)
## # A tibble: 8 × 10
##   variable         mean median    sd   mad     q5    q95  rhat ess_bulk ess_tail
##   <chr>           <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>    <dbl>    <dbl>
## 1 beta0          -4.47  -4.47  0.011 0.011 -4.49  -4.45   1.00    5453.    2671.
## 2 betas[1]       -0.272 -0.271 0.134 0.134 -0.491 -0.052  1.00    3424.    3062.
## 3 betas[2]        0.016  0.017 0.041 0.041 -0.051  0.081  1       3146.    3063.
## 4 betas[3]        0.003  0.003 0.002 0.002  0      0.007  1       3932.    3183.
## 5 betas[4]        0.044  0.044 0.008 0.008  0.03   0.058  1.00    2603.    2458.
## 6 rho             0.771  0.774 0.04  0.039  0.702  0.831  1.01     341.     769.
## 7 sigma           0.901  0.899 0.047 0.048  0.824  0.981  1.01     310.     850.
## 8 beta_intercept -7.64  -7.64  0.511 0.512 -8.46  -6.80   1.00    3251.    3405.

Alrighty–we’re working with nice rhat values again! And, interestingly, the coefficient of 0.77 for rho means that these data are highly spatially correlated (it’s the proportion of spatial variance in the combined spatial/non-spatial random effects). Next up, some posterior predictive checks and the correlation matrix.

Checking the Fit

The BYM2 Spatial Correlation Matrix

phi_bym2 <- as_draws_matrix(bym2_fit_1$draws("phi"))
plot_icar_corr_matrix(phi_bym2, 'BYM2 spatial correlation matrix')

Ugh–kind of hard to read. There’s definitely more red along the diagonal, which is good. Morris’s graph showed the same pattern. I’m going to be honest, I’m starting to wonder whether my issue is actually that I have significantly more tracts than they do, so this might just be harder to read, absent any statistical difference.

Anyway, let’s do the posterior predictive checks.

Posterior Predictive Checks

We’ll replicate the steps from the other models to check our fit.

y_rep_bym2 <- as_draws_matrix(bym2_fit_1$draws("y_rep"))
ppc_central_interval(y_rep_bym2, bym2_data$y)
## [1] "y total: 1956, ct y is within y_rep central 50% interval: 1942, pct: 99.28"
ppc_y_yrep_overlay(y_rep_bym2, bym2_data$y,
                     'Poisson + BYM2 model PPC\ny (blue dot) vs. y_rep (orange 50% central interval, grey full extent)')

The posterior predictive coverage for this model is just as good as the others, if not better. So, where have I gotten in this post? It’s all become a bit abstract. My understanding of the purpose of the BYM2 model is that it’s effective at handling both spatial and non-spatial random effects. I think this will continue to be important when I move to phase III of the project: introducing monthly counts to the data structure, rather than having a single summed count across everything. This addition of more hierarchical structure to the data (monthly effects, more variation within tracts) will allow me to hopefully understand trends over time.

And on that bombshell, I think it’s time to wrap up before this gets even longer.